Scalable Statistical Methods for Ancestral Inference from Genomic Variation Data

نویسندگان

  • Andrew Chan
  • Andrew Hans Chan
  • Haiyan Huang
چکیده

Scalable Statistical Methods for Ancestral Inference from Genomic Variation Data by Andrew Hans Chan Doctor of Philosophy in Computer Science University of California, Berkeley Professor Yun S. Song, Chair Developments in DNA sequencing technology over the last few years have yielded unprecedented volumes of genetic data. The resulting datasets are indispensable for a variety of purposes, from understanding cancer to answering questions about evolution. Despite the ease with which one can obtain these large quantities of data, the task of extracting meaning from the data remains an open and challenging problem. In this work, we develop statistical methods to infer population genetic parameters from high-throughput sequencing data through the use of coalescent theory, which stochastically models the evolution of DNA from generation to generation. Because closed analytic formulas are unknown for many parameters of interest, computational methods such as Markov Chain Monte Carlo and Sequential Importance Sampling become particularly relevant. We develop a method using reversible jump MCMC to infer genome-wide variable recombination rates and apply it to data from two Drosophila melanogaster populations. Our analysis of the results reveals several interesting findings. A systematic search for hotspot regions reveals only a few occurrences along the genome, far less than that observed in human. We apply a wavelet analysis to quantify the differences between the recombination maps of the two populations, and find that although there is high variability at the fine scales, the recombination maps demonstrate general agreement at the broad scales. The correlation between various genomic features is also assessed using the wavelet analysis, and we find, in contrast to humans, a correlation between recombination and diversity. In addition, we describe a particle filtering method to sample genealogies from the posterior distribution. Particle filtering is a model estimation technique in the family of sequential importance sampling methods. It provides the ability to perform inference on a continuous state space where the distributions under consideration are complex enough such that exact inference is intractable. The sequentially Markov coalescent, an approximation to the coalescent model where the Markov property is imposed along the sequence, is used to decompose the likelihood of the data into the product of conditional densities and allows inference on otherwise intractably long sequences of genomic data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Statistical models for analyzing human genetic variation

Statistical Models for Analyzing Human Genetic Variation by Sriram Sankararaman Doctor of Philosophy in Computer Science and the Designated Emphasis in Computational and Genomic Biology University of California, Berkeley Professor Michael I. Jordan, Chair Advances in sequencing and genomic technologies are providing new opportunities to understand the genetic basis of phenotypes such as disease...

متن کامل

Thesis proposal Learning Ancestral Genetic Processes using Nonparametric Bayesian Models

Recent explosion of genomic data have fueled the long-standing interest of analyzing genetic variations to reconstruct the evolutionary history and ancestral structures of human populations that can provide essential clues for various medical applications. Although genetic properties such as linkage disequilibrium (LD) and population structure are closely related under a common inheritance proc...

متن کامل

TESS3: fast inference of spatial population structure and genome scans for selection.

Geography and landscape are important determinants of genetic variation in natural populations, and several ancestry estimation methods have been proposed to investigate population structure using genetic and geographic data simultaneously. Those approaches are often based on computer-intensive stochastic simulations and do not scale with the dimensions of the data sets generated by high-throug...

متن کامل

Bayesian inference of fine-scale recombination rates using population genomic data.

Recently, several statistical methods for estimating fine-scale recombination rates using population samples have been developed. However, currently available methods that can be applied to large-scale data are limited to approximated likelihoods. Here, we developed a full-likelihood Markov chain Monte Carlo method for estimating recombination rate under a Bayesian framework. Genealogies underl...

متن کامل

Analysis of Admixed Animals Using Indirect Haplotype Information from Existing Technologies

CHEN-PING FU: ANALYSIS OF ADMIXED ANIMALS USING INDIRECT HAPLOTYPE INFORMATION FROM EXISTING TECHNOLOGIES. (Under the direction of Leonard McMillan.) The use of genotyping and sequencing technologies in genetic studies typically involves inspecting variants defined within a single reference genome. While this definition of genetic variation promotes a simple model of the genome that is easy to ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014